Setup

1.1 Load packages

library(ggplot2)
library(scales)
library(corrplot)
library(fitdistrplus)
library(dplyr)
library(RColorBrewer)

1.2 Load data

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.

  1. BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

    This is an observational study - only correlation can be evaluated.

  2. Survey design includes stratification alongside random sampling. Use of the final weight in analysis is necessary if users are to make generalizations from the sample to the population. I will skip it for now, so my results cannot be generalized to the population.

  3. Data was collected through a telephone-based survey, so it is non-response biased. (Schneider et al. 2010)

Age and ethnicity misrepresentation. Distribution of the sample by the calculated age groups (X_age_g) and sex (sex) filled by race (X_mrace1: Calculated Non-Hispanic Race Including Multiracial):

brfss2013 %>%
  filter(!is.na(sex)) %>% 
    ggplot(aes(x = X_age_g, fill = X_mrace1 )) +
      labs( x="age group", title = "Distribution by sex and age groups", 
            fill="Race" ) +
      theme_light() +
      theme(axis.text.x = element_text(size=11, angle = 90, vjust = 0.5, hjust=1)) +
      geom_bar() +
      facet_wrap(~sex) +
      scale_fill_brewer(palette = "Set2")

Indeed, age demographics is heavily skewed to the older age, especially in the female group. Race demographics skewed as well. As this is crucial for calculating this skewness should be addressed in detailed analysis later.

#cleaning data without age and sex
brfss2013 <- brfss2013 %>% 
  filter(!is.na(X_age_g), !is.na(sex) )

#weights here

#creating separate dataset for each research question
dat_q1<-brfss2013 
dat_q3<-brfss2013 

Part 2: Research questions

In the last centuries humanity eliminated numerous maladies and invented many new ones as well. One of them is obesity pandemic and diseases associated with it - diabetes type 2, cardiovascular diseases, cancer, non-alcoholic fatty liver disease and many other disorders (Grundy 2004; Haslam and James 2005; Bray 2004)

From CDC <https://www.cdc.gov/obesity/data/adult.html>

Obesity is a common, serious, and costly disease

  • The US obesity prevalence was 42.4% in 2017 – 2018.

  • From 1999 –2000 through 2017 –2018, US obesity prevalence increased from 30.5% to 42.4%. During the same time, the prevalence of severe obesity increased from 4.7% to 9.2%.

  • Obesity-related conditions include heart disease, stroke, type 2 diabetes and certain types of cancer. These are among the leading causes of preventable, premature death.

  • The estimated annual medical cost of obesity in the United States was $147 billion in 2008. Medical costs for people who had obesity was $1,429 higher than medical costs for people with healthy weight.

2.1 Diabetes mellitus

It was established a long time ago that diabetes mellitus is closely linked to numerous health conditions. If left untreated, diabetes can cause many health complications. Acute complications can include diabetic ketoacidosis, hyperosmolar hyperglycemic state, or death. (Kitabchi et al. 2009) Serious long-term complications include cardiovascular disease, chronic kidney disease, foot ulcers, neuropathy, retinopathy, cognitive impairment and others. (Harding et al. 2018)

  • Type 1 diabetes is an idiopathic (fancy word for “we don’t know the reason”) auto-immune disorder with usually an early life onset resulting from failure of the pancreas to produce enough insulin due to loss of beta cells. This form was previously referred to as “insulin-dependent diabetes mellitus” (IDDM) or “juvenile diabetes.”(DiMeglio, Evans-Molina, and Oram 2018)

  • Type 2 diabetes begins with insulin resistance, a condition in which cells fail to respond to insulin properly. As the disease progresses, a lack of insulin may also develop. (Galicia-Garcia et al. 2020)

    Prevention and treatment of type 2 diabetes involves a lot of lifestyle changing measures including but not limited to: maintaining a healthy diet, regular physical exercise, decreasing sedentary time, controlling stress level and mental health support, normal housing environment and sleeping quality, maintaining normal body weight, avoiding use of tobacco. (Kolb and Martin 2017)

  • Gestational diabetes is the third main form, and occurs when pregnant women without a previous history of diabetes develop high blood sugar levels. It is closely related with type 2 and has the same basis - insulin resistance (Plows et al. 2018)

2.2 Insulin resistance (IR)

It is a pathological condition in which cells fail to respond normally to the insulin.

Insulin is a hormone that allows glucose to enter cells which also reduces blood glucose level. Insulin is released by the pancreas in response to carbohydrates consumed in the diet. When higher circulating insulin levels are necessary to achieve the glucose-lowering response, a subject is considered insulin resistant. There are many causes of insulin resistance. (Petersen and Shulman 2018) Insulin resistance is considered a component of metabolic syndrome.

This is a very personal problem for me as despite maintaining normal weight and a relatively healthy lifestyle I have to deal with metabolic syndrome. Keeping a diet low in refined sugars and regular physical exercises helped me drastically increase life quality.

Risk factors

There are a number of risk factors for insulin resistance, including being overweight or obese or having a sedentary lifestyle. Various genetic factors can increase risk, such as a family history of diabetes, and there are some specific medical conditions associated with insulin resistance, such as polycystic ovary syndrome. (Schulze and Hu 2005)

2.3 Specific risk factors (Research question 1 and 2)

2.3.1 Specific risk factors

The National Institute of Diabetes and Digestive and Kidney Diseases state specific risks that may predispose an individual to insulin resistance also include:

  • being aged 45 or older

  • having African American, Alaska Native, American Indian, Asian American, Hispanic/Latino, Native Hawaiian, or Pacific Islander American ethnicity

  • having health conditions such as high blood pressure and abnormal cholesterol levels

  • having a history of gestational diabetes

  • having a history of heart disease or stroke.

In addition some medications and other health conditions can raise the risk.

There are additional topics which I want to include in this part of research:

  • obesity as a risk factor and its relationship with other specific risk factors

  • evaluation of income level as a risk factor

2.3.2 BMI as evaluation of obesity

BMI (Body Mass Index) - is a very convenient and simple form of evaluating obesity.

BMI is defined as the subject’s weight divided by the square of their height and is calculated as follows,

BMI = m / h^2

where m and h are the subject’s weight and height, respectively.

BMI is usually expressed in kilograms of weight per meter squared of height.

The most commonly used definitions, established by the World Health Organization (WHO) in 1997 and published in 2000, provide the values listed in the table.

Category BMI (kg/m2)
Underweight < 18.5
Normal weight 18.5 – 24.9
Overweight 25.0 – 29.9
Obese (Class I) 30.0 – 34.9
Obese (Class II) 35.0 – 39.9
Obese (Class III) ≥ 40.0

As Asian populations develop negative health consequences at a lower BMI than Caucasians, some nations have redefined obesity; Japan has defined obesity as any BMI greater than 25 kg/m2(Kanazawa et al. 2005)while China uses a BMI of greater than 28 kg/m2.(Bei-Fan 2002)

There is a more precise metric - hip to waist ratio, which takes into consideration fat distribution in the body, (Sweeting 2007) but it is impossible to derive from this dataset.

2.3.3 Distribution of obesity by age groups

Explore how overweight and obesity rates differ among age groups.

Variables used:

  • X_bmi5cat: Computed Body Mass Index Categories
  • X_age_g: Imputed Age In Six Groups

  • sex: Respondents Sex

2.3.4 Distribution of obesity by income groups

Explore how overweight and obesity rates differ among income groups.

Variables used:

  • X_bmi5cat: Computed Body Mass Index Categories
  • X_incomg: Computed Income Categories

  • sex: Respondents Sex

2.3.5 Distribution of diabetes by race and age

In this part I will explore age and ethnicity as a risk factor

Variables used:

  • X_mrace1: Calculated Non-Hispanic Race Including Multiracial

  • X_age_g: Imputed Age In Six Groups

  • diabete3: (Ever Told) You Have Diabetes

Value Value Label Frequency Percent Cumulative Frequency
1 Yes 62,345 12.68 12.68
2 Yes, but female told only during pregnancy 4,607 0.94 13.61
3 No 415,394 84.47 98.08
4 No, pre-diabetes or borderline diabetes 8,596 1.75 99.83

As total diabetes score I’m going to use diabetes, pre-diabetes and gestational diabetes combined because of the similar initial mechanism and moreover it is not possible to separate type 1 from type 2 from this dataset. Type 1 diabetes is relatively rare compared to type 2 - around 10-15%, but its incidence is also increasing. (Katsarou et al. 2017)

2.3.6 Distribution of diabetes by race and income levels

In this part I will explore income and ethnicity as a risk factor

Variables used:

  • X_mrace1: Calculated Non-Hispanic Race Including Multiracial
  • X_incomg: Computed Income Categories
  • diabete3: (Ever Told) You Have Diabetes

As a total diabetes score I’m going to use diabetes, pre-diabetes and gestational diabetes combined.

2.3.7 Obesity correlation with diabetes by income levels

In this part I will explore obesity and income as a risk factor

Variables used:

  • diabete3: (Ever Told) You Have Diabetes
  • X_rfbmi5: Overweight Or Obese Calculated Variable
  • X_incomg: Computed Income Categories

2.3.8 Obesity correlation with cardiovascular (CV) diseases by income levels

In this part I will explore CV diseases and income as a risk factor

Variables used:

  • cvdinfr4: Ever Diagnosed With Heart Attack

  • cvdcrhd4: Ever Diagnosed With Angina Or Coronary Heart Disease

  • cvdstrk3: Ever Diagnosed With A Stroke

  • bphigh4: Ever Told Blood Pressure High

As a total CV disease score I’m going to use a combination of these 4 variables.

  • X_rfbmi5: Overweight Or Obese Calculated Variable
  • X_incomg: Computed Income Categories

2.4 Lifestyle risk-factors (Research question 3)

In this chapter I’m going to explore different lifestyle risk factors and their correlation with health

2.4.1 Health benchmark

I would like to use in my analysis a combined health benchmark to incorporate mental health (Number Of Days Mental Health Not Good) and some objectivity (Number Of Days Physical Health Not Good) in general health evaluation. I’m going to include following parameters:

  • genhlth: General Health

  • physhlth: Number Of Days Physical Health Not Good

  • menthlth: Number Of Days Mental Health Not Good

2.4.2 Food habits

I want to calculate daily fruits and vegetable consumption. I will roughly estimate healthy food habits from the amount of fruit portions per day (only whole fruits, as I do not include fresh juice here because of its high sugar low fiber content), green and orange vegetable portions per day (as others can be too starchy) and legumes per week.

  • fruit1: How Many Times Did You Eat Fruit?

  • fvbeans: How Many Times Did You Eat Beans Or Lentils?

  • fvgreen: How Many Times Did You Eat Dark Green Vegetables?

  • fvorang: How Many Times Did You Eat Orange-Colored Vegetables?

Important notion and limitation of the model - it does not include individual requirements. Data regarding analysis of this variables was taken from <https://www.cdc.gov/brfss/data_documentation/pdf/fruits_vegetables.pdf>

The newer Healthy People 2020 objectives are to increase the contribution of fruits and vegetables to the diets of Americans from 0.5 to 0.9 cup equivalent of fruits per 1,000 calories and 0.8 to 1.1 cup equivalent of total vegetables per 1,000 calories.6 Second, national recommendations for fruit and vegetable intake changed from a general 5-a-day recommendation for everyone, to recommendations based on an individual’s age, sex, and physical activity level

2.4.3 Sugar consumption

Excessive caloric intake in the form of a diet high in fast carbs was stated as one of the major risk factors for developing insulin resistance. I’m interested in the evaluation of correlation between sugar consumption and health based on accessible parameters. There isn’t much data available here, as there are just two sugar consumption questions and they are located in the optional part of the questionnaire.

Optional Module 5 - Sugar Drinks

  • ssbsugar: How Often Do You Drink Regular Soda Or Pop?

  • ssbfrut2: How Often Did You Drink Sugar-Sweetened Drinks?

Only fresh fruit juice consumption is located in the main part. As fruit juice is high in sugars and very little fiber I will consider it in this part.

  • fruitju1: How Many Times Did You Drink 100 Percent Pure Fruit Juices?

2.4.4 Physical activity

  • exerany2: Exercise In Past 30 Days

I have chosen the simplest variant of evaluating physical activity, because any type and intensity is better than nothing and I assume that people who do hard physical labor (when excess physical activity should be harmful for health) don’t do extra exercises.

2.4.5 Smoking and alcohol consumption

  • X_rfsmok3: Current Smoking Calculated Variable

I would love to use a numerical variable reflecting the amount of cigarettes smoked a day, but this data isn’t available here.

  • X_drnkmo4: Computed Total Number Drinks A Month

Evaluating alcohol consumption

CDC evaluation of heavy drinking:

Calculated Variables in the 2015 Data File of the Behavioral Risk Factor Surveillance System

Page 28 of 59 May 27, 2016
Section 9: Alcohol Consumption
_RFDRHV5 Calculated variable for heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)

2.4.6 Sleep pattern

Disruption in the sleep pattern is one of major risk factors for development of insulin resistance and diabetes. I will include the amount of sleep hours as my metric, as there is no other data available here.

  • sleptim1: How Much Time Do You Sleep

Part 3: Exploratory data analysis

Part 3.1 Specific risk factors (Research question 1 and 2)

3.1.1 Distribution of obesity by age groups

  • X_bmi5cat: Computed Body Mass Index Categories
  • X_age_g: Imputed Age In Six Groups
  • sex: Respondents Sex
#is obese
dat_q1 <- dat_q1 %>%
  filter(!is.na(X_bmi5cat)) %>% 
  mutate(X_obese = case_when (
                    as.numeric(X_bmi5cat)==4 ~ "Obese",
                    as.numeric(X_bmi5cat)==3 ~ "Overweight",
                    TRUE ~ "Normal or underweight"
  ))

#calculating percentage
for_graph_obese_age <- dat_q1 %>%
  filter(!is.na(X_obese)) %>% 
  group_by(X_age_g,sex,X_obese) %>% 
  summarise(n=n()) %>% 
  group_by(X_age_g,sex) %>% 
  mutate(perc=n/sum(n)) 

#distribution of obesity by age groups and sex
for_graph_obese_age %>%
  ggplot(aes(x = sex, y=perc,  fill = X_obese )) +
  labs( 
       x="", 
       y="percentage",
       title = "Distribution of obesity by age and sex", 
       fill="Weight" 
      ) +
  theme_light() +
  facet_wrap( ~ X_age_g) +
  geom_bar(alpha=1,stat="identity")+
  scale_fill_manual(values=c("#e6f5d0", "#c51b7d", "#4d9221")) +
  geom_text(
      aes(x=sex, y=perc, label = percent(perc,2), vjust=1.5), 
      position = position_fill()
      ) 

There is a noticeable difference between sexes. We can see maximum amount of overweight and obese people in middle age groups. Extensive analysis which will account for skewness and ethnic differences (different BMI evaluation for Asian) is needed.

3.1.2 Distribution of obesity by income groups

  • X_bmi5cat: Computed Body Mass Index Categories
  • X_incomg: Computed Income Categories

  • sex: Respondents Sex

#calculating percentage
for_graph_obese_inc <- dat_q1 %>%
  filter(!is.na(X_incomg), !is.na(X_obese),) %>% 
  group_by(X_incomg,sex,X_obese) %>% 
  summarise(n=n()) %>% 
  group_by(X_incomg,sex) %>% 
  mutate(perc=n/sum(n)) 

#distribution of obesity by income groups and sex
for_graph_obese_inc %>%
  ggplot(aes(x = sex, y=perc,  fill = X_obese )) +
  labs( 
       x="", 
       y="percentage",
       title = "Distribution of obesity by income and sex", 
       fill="Weight" 
      ) +
  theme_light() +
  facet_wrap( ~ X_incomg) +
  geom_bar(alpha=1,stat="identity")+
  scale_fill_manual(values=c("#e6f5d0", "#c51b7d", "#4d9221")) +
  geom_text(
      aes(x=sex, y=perc, label = percent(perc,2), vjust=1.5), 
      position = position_fill()
      ) 

If the overall difference is not big, there is an increasing gap between men and women as income level increases. It would be interesting to look at fresh data, as there has been a new trend in healthy lifestyle recently. Extensive analysis which will account for skewness and ethnic differences (different BMI evaluation for Asian) is needed.

3.1.3 Distribution of diabetes by race and age

  • X_mrace1: Calculated Non-Hispanic Race Including Multiracial

  • X_age_g: Imputed Age In Six Groups

  • diabete3: (Ever Told) You Have Diabetes

#combined diabetes score
dat_q1 <- dat_q1 %>% 
  filter(!is.na(diabete3)) %>% 
  mutate(diabetes_char = case_when( 
                        as.numeric(diabete3) == 3 ~ "No", 
                        TRUE ~ "Yes" 
  )) 

#calculating percentage
for_graph_ethno <- dat_q1 %>%
  filter(!is.na(X_age_g), !is.na(diabetes_char),!is.na(X_mrace1),) %>% 
  group_by(X_age_g,X_mrace1,diabetes_char) %>% 
  summarise(n=n()) %>% 
  group_by(X_age_g,X_mrace1) %>% 
  mutate(perc=n/sum(n)) 

#distribution of the diabetes by race and age
for_graph_ethno %>%
  ggplot(aes(x = X_mrace1, y=perc,  fill = diabetes_char )) +
  labs( 
       x="", 
       y="percentage",
       title = "Distribution of diabetes by race and age", 
       fill="diabetis" 
      ) +
  theme_light() +
  theme(
        axis.text.y   = element_text(size=8),
        axis.text.x = element_text(size=7, angle = 20, vjust = 1, hjust=1),
        axis.title.y  = element_text(size=8),
        axis.title.x  = element_text(size=8),
        ) +
  facet_wrap( ~ X_age_g) +
  geom_bar(alpha=1,stat="identity") +
  scale_fill_manual(values=c("#e6f5d0", "#c51b7d")) +
  geom_text(
      aes(x=X_mrace1, y=perc, label = percent(perc,2), vjust=0, color = diabetes_char), 
      position = position_fill(),
      show.legend = FALSE
      ) +
  scale_colour_manual(values=c("white", "black"))

3.1.4 Distribution of diabetes by race and income levels

  • X_mrace1: Calculated Non-Hispanic Race Including Multiracial
  • X_incomg: Computed Income Categories
  • diabete3: (Ever Told) You Have Diabetes
#calculating percentage
for_graph_income <- dat_q1 %>%
  filter(!is.na(diabetes_char),!is.na(X_mrace1),!is.na(X_incomg )) %>% 
  group_by(X_incomg,X_mrace1,diabetes_char) %>% 
  summarise(n=n()) %>% 
  group_by(X_incomg,X_mrace1) %>% 
  mutate(perc=n/sum(n)) 


#Distribution of diabetes by race and income
for_graph_income %>%
  ggplot(aes(x = X_mrace1, y=perc,  fill = diabetes_char )) +
  labs( 
    x="", 
    y="percentage",
    title = "Distribution of diabetes by race and income", 
    fill="diabetis" 
  ) +
  theme_light() +
  theme(
        axis.text.y   = element_text(size=8),
        axis.text.x = element_text(size=7, angle = 90, vjust = 1, hjust=1),
        axis.title.y  = element_text(size=8),
        axis.title.x  = element_text(size=8),
        ) +
  facet_wrap( ~ X_incomg) +
  geom_bar(alpha=1,stat="identity")+
  scale_fill_manual(values=c("#e6f5d0", "#c51b7d")) +
  geom_text(
      aes(x=X_mrace1, y=perc, label = percent(perc,2), vjust=0, color = diabetes_char), 
      position = position_fill(),
      show.legend = FALSE
      ) +
  scale_colour_manual(values=c("white", "black"))

In the both diagrams similar patterns can be noticed with increased levels for some groups (Native Hawaiian or other Pacific Islander, Black or African American and American Indian or Alaskan Native). But due to serious misrepresentation of these groups it requires adjustments.

Extensive analysis which will account for skewness and weights (account for design of the study) is needed.

3.1.5 Obesity correlation with diabetes

  • X_rfbmi: Overweight Or Obese Calculated Variable

Adults who have a body mass index greater than 25.00 (Overweight or Obese)

  • X_incomg: Computed Income Categories
  • diabete3: (Ever Told) You Have Diabetes
#calculating percentages
for_graph <- dat_q1 %>% 
  select(X_incomg, X_rfbmi5, diabetes_char) %>% 
  filter(!is.na(X_incomg), !is.na(X_rfbmi5), !is.na(diabetes_char)) %>% 
  group_by( X_incomg, X_rfbmi5, diabetes_char) %>% 
  summarise(n=n()) %>% 
  group_by( X_incomg, X_rfbmi5) %>% 
  mutate(perc=n/sum(n)) 
## `summarise()` has grouped output by 'X_incomg', 'X_rfbmi5'. You can override using the `.groups` argument.
#Obesity and CV disease distribution by income groups
for_graph %>%
   ggplot(aes(x = X_rfbmi5, y = perc, fill=diabetes_char, label=perc)) +
    labs( 
      title = "Obesity and diabetes", 
      x="Overweight or obese", 
      y = "percent",  
      fill="Diabetes" 
    ) +
    theme_light() +
    geom_bar(stat="identity" ) +
    facet_wrap(~X_incomg)+
    scale_fill_manual(values=c("#e6f5d0", "#c51b7d")) +
    geom_text(
      aes(x=X_rfbmi5, y=perc, label = percent(perc,2), vjust=0, color = diabetes_char), 
      position = position_fill(),
      show.legend = FALSE
    ) +
    scale_colour_manual(values=c("white", "black"))

Despite almost the same obesity rate among income groups (see part 3.1.2), levels of diabetes differ significantly. From which I can make a conjecture that income level alone could be a significant risk factor for diabetes.

Extensive analysis which will account for skewness and weights (account for design of the study) is needed.

3.1.6 Obesity correlation with CV diseases

  • cvdinfr4: Ever Diagnosed With Heart Attack

  • cvdcrhd4: Ever Diagnosed With Angina Or Coronary Heart Disease

  • cvdstrk3: Ever Diagnosed With A Stroke

  • bphigh4: Ever Told Blood Pressure High

  • X_rfbmi5: Overweight Or Obese Calculated Variable

  • X_incomg: Computed Income Categories

#creating combined CV score
dat_q1 <- dat_q1 %>% 
  filter(!is.na(cvdinfr4),!is.na(cvdcrhd4),!is.na(cvdstrk3)) %>% 
  mutate(heart_char = case_when( 
                        as.numeric(cvdinfr4) == 1 ~ "Yes", 
                        as.numeric(cvdcrhd4) == 1 ~ "Yes",
                        as.numeric(cvdstrk3) == 1 ~ "Yes",
                        as.numeric(bphigh4) == 1 ~ "Yes",
                        TRUE ~ "No" 
  ) %>% factor())    

#calculating percentages
for_graph <- dat_q1 %>% 
  select(X_incomg, X_rfbmi5, heart_char) %>% 
  filter(!is.na(X_incomg), !is.na(X_rfbmi5), !is.na(heart_char)) %>% 
  group_by( X_incomg, X_rfbmi5, heart_char) %>% 
  summarise(n=n()) %>% 
  group_by( X_incomg, X_rfbmi5) %>% 
  mutate(perc=n/sum(n)) 

#Obesity and CV disease distribution by income groups
for_graph %>%
  ggplot(aes(x = X_rfbmi5, y = perc, fill=heart_char, label=perc)) +
    labs( 
      title = "Obesity and CV disease", 
      x="Overweight", 
      y = "percent",  
      fill="CV disease" 
    ) +
    theme_light() +
    geom_bar(stat="identity" ) +
    facet_wrap(~X_incomg) +
    scale_fill_manual(values=c("#e6f5d0", "#c51b7d")) +
    geom_text(
      aes(x=X_rfbmi5, y=perc, label = percent(perc,2), vjust=0, color = heart_char),       position = position_fill(),
      show.legend = FALSE
    ) +
    scale_colour_manual(values=c("white", "black"))

Overall dynamics is the same with diabetes - income level alone could be a significant risk factor for CV diseases.

Extensive analysis which will account for skewness and weights (account for design of the study) is needed.

3.1.7 Conclusion

For conclusion I will make a correlation table of specific risk-factors

#create diabetes numerical score  
dat_q1 <- dat_q1 %>% 
  filter(!is.na(diabete3)) %>% 
  mutate(diabetes_num = case_when ( 
                          as.numeric(diabete3) == 3 ~ 0, 
                          TRUE ~ 1 
                     )  )

#creating numerical CV score
dat_q1 <- dat_q1 %>% 
  filter(!is.na(cvdinfr4),!is.na(cvdcrhd4),!is.na(cvdstrk3)) %>% 
  mutate(heart_num = case_when( 
                        as.numeric(cvdinfr4) == 1 ~ 1, 
                        as.numeric(cvdcrhd4) == 1 ~ 1,
                        as.numeric(cvdstrk3) == 1 ~ 1,
                        as.numeric(bphigh4) == 1 ~ 1,
                        TRUE ~ 0
  ) )  

#apparently there is no separate numeric age variable in the database!
dat_q1 <- dat_q1 %>% 
  filter(!is.na(X_ageg5yr)) %>% 
  mutate(age_num = as.numeric(X_ageg5yr))
   
#creating numerical variable for income level
dat_q1 <- dat_q1 %>% 
  filter(!is.na(income2)) %>% 
  mutate(income_num = as.numeric(income2))
 
#selecting variables for correlation table
m_cor_data <- dat_q1 %>% 
  dplyr::select(
          age_num,
          income_num,
          diabetes_num,
          heart_num,
          X_bmi5
          ) 

#creating correlation table
matrix.cor <- cor(m_cor_data, use = "complete.obs")
corrplot(matrix.cor, method="color",  
         type="upper",
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
          # hide correlation coefficient on the principal diagonal
         diag=FALSE,  number.cex= 7/ncol(matrix.cor)
         )

Here we can see that correlation between age, income level and BMI is very weak. CV diseases correlate most heavily with age as expected, then comes diabetes, BMI and income level.

Part 3.2 Lifestyle risk-factors (Research question 3)

3.2.1 Health benchmark

  • genhlth: General Health

  • physhlth: Number Of Days Physical Health Not Good

  • menthlth: Number Of Days Mental Health Not Good

#normalizing general health
dat_q3 <- dat_q3 %>%
  filter(!is.na(genhlth)) %>% 
  mutate(genhlth_num = case_when( 
                        as.numeric(genhlth) == 1 ~ 1, 
                        as.numeric(genhlth) == 2 ~ 0.75,
                        as.numeric(genhlth) == 3 ~ 0.5,
                        as.numeric(genhlth) == 4 ~ 0.25,
                        as.numeric(genhlth) == 5 ~ 0,
                        TRUE ~ as.numeric(genhlth)
  ))

#percentage of bad physical health days 
dat_q3 <- dat_q3 %>%
  filter(!is.na(physhlth)) %>% 
  mutate(X_health_benchmark = genhlth_num - physhlth/30)

#percentage of bad mental health days
dat_q3 <- dat_q3 %>%
  filter(!is.na(menthlth)) %>% 
  mutate(X_health_benchmark = X_health_benchmark - menthlth/30)

#distribution
dat_q3 %>%
   ggplot() +
    labs( title = "Distribution of health", x="general health" ) +
    theme_light() + 
    geom_density(aes(x=genhlth_num, y=..scaled.., color="General health"), geom="line") +
    geom_density(aes(x=X_health_benchmark, y=..scaled.., color="Combined health benchmark"), geom="line") +
    scale_x_continuous(name='Health score') +
    scale_color_manual(
        name='Health benchmark type',
        values = c("General health"="#c51b7d","Combined health benchmark"="#4d9221")
     ) 
## Warning: Ignoring unknown parameters: geom

## Warning: Ignoring unknown parameters: geom

We can see that the combined general health score is distributed more continuously.

3.2.2 Fruits

fruit1: How Many Times Did You Eat Fruit?

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(fruit1))

#daily fruits consumption
dat_q3 <- dat_q3 %>% 
  mutate(X_fruit_daily = case_when(
                      fruit1 < 200 ~ fruit1 %% 100 ,
                      fruit1 > 200 & fruit1 < 300 ~ fruit1 %% 200 /7,
                      fruit1 > 300 ~ fruit1 %% 300 / 30,
                      TRUE ~ 0
                   )
           ) 

#checking for outliers
dat_q3 %>% 
dplyr::select(X_fruit_daily) %>% 
  summarise(
    f_mean = mean(X_fruit_daily), 
    f_median = median(X_fruit_daily), 
    f_sd = sd(X_fruit_daily), 
    f_min = min(X_fruit_daily), 
    f_max = max(X_fruit_daily),
    f_count = n()
  )
##     f_mean f_median     f_sd f_min f_max f_count
## 1 1.019217        1 1.116004     0    99  442030

Statistics looks OK, but seems to be some outliers.

Raw data distribution
#Distribution of fruit daily intake
dat_q3 %>%
  ggplot() +
    labs( title = "Distribution of fruit daily intake", x="fruit daily intake" ) +
    theme_light() + 
    stat_density(aes(x=X_fruit_daily+0.01, y=..scaled..), geom="line",color="#c51b7d") 

Distribution analysis

Distribution looks right skewed with some outliers. As a method to deal with the outliers here and later I’ve chosen to trim up to 99th percentile. More elaborate ways are dependent on the type of distribution and currently out of scope of this preliminary research.

Example of fitness for continuous distribution

#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(X_fruit_daily <= quantile(dat_q3$X_fruit_daily,0.99))

descdist(dat_q3$X_fruit_daily, discrete = FALSE)

## summary statistics
## ------
## min:  0   max:  4 
## median:  1 
## mean:  0.9753597 
## estimated sd:  0.8302213 
## estimated skewness:  1.337087 
## estimated kurtosis:  4.609642

For the sake of simplicity for now I will only use one model (chosen as the best guess for now). Later extended analysis will be added.

Distribution fitness

Example - fitness for normal

fit_X_fruit_daily.norm <- fitdist(dat_q3$X_fruit_daily, "norm")
plot(fit_X_fruit_daily.norm)

Mix of continuous and discrete distribution. Extended analysis will be needed later.

Final distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of fruit daily intake", 
      x="fruit daily intake", 
      color="Age groups" 
    ) +
    theme_light() + 
    stat_density(aes(x=X_fruit_daily, y=..scaled.., color=X_age_g), geom="line") + 
    scale_color_brewer(palette = "Set1")

Health benchmark correlation
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_fruit_daily, y=X_health_benchmark, color=X_age_g )) +
   labs( x="fruit daily intake", title = "health score and fruit daily intake", color="age" ) +
    theme_light() +
    geom_smooth() +
    facet_wrap( ~ sex)+ 
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Nonlinear. Positive on a continuous interval [0,1] . Extended analysis will be needed.

3.2.3 Green vegetables

fvgreen: How Many Times Did You Eat Dark Green Vegetables?

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(fvgreen))

#daily green vegetables consumption
dat_q3 <- dat_q3 %>% 
  mutate(X_vege_green_daily = case_when(
                      fvgreen < 200 ~ fvgreen %% 100 ,
                      fvgreen > 200 & fvgreen < 300 ~ fvgreen %% 200 /7,
                      fvgreen > 300 ~ fvgreen %% 300 / 30,
                      TRUE ~ 0
                   )
           ) 

#checking for outliers
dat_q3 %>% 
dplyr::select(X_vege_green_daily) %>% 
  summarise(
    v_mean = mean(X_vege_green_daily), 
    v_median = median(X_vege_green_daily), 
    v_sd = sd(X_vege_green_daily), 
    v_min = min(X_vege_green_daily), 
    v_max = max(X_vege_green_daily),
    v_count = n()
  )
##      v_mean  v_median      v_sd v_min v_max v_count
## 1 0.5450731 0.4285714 0.5951431     0    99  433474

Checking distribution

Raw data distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of green vegetables daily intake", 
      x="green vegetables daily intake", 
      color="Age groups" 
    ) +
    theme_light() +    
    stat_density(aes(x=X_vege_green_daily+0.01, y=..scaled.., color=X_age_g), geom="line") +
    scale_x_continuous(trans = log10_trans(),
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x))) + 
    scale_color_brewer(palette = "Set1")

Distribution analysis

Distribution looks right skewed with some outliers. Example of fitness for discrete distribution

#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(X_vege_green_daily <= quantile(dat_q3$X_vege_green_daily,0.99))

descdist(dat_q3$X_vege_green_daily, discrete = TRUE)

## summary statistics
## ------
## min:  0   max:  3 
## median:  0.4285714 
## mean:  0.531312 
## estimated sd:  0.496137 
## estimated skewness:  1.859271 
## estimated kurtosis:  7.97367
Distribution fitness
fit_X_vege_green.exp <- fitdist(dat_q3$X_vege_green_daily, "exp")
plot(fit_X_vege_green.exp)

Mix of continuous and discrete distribution. Extended analysis will be needed later.

Final distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of green vegetables daily intake", 
      x="green vegetables daily intake", 
      color="Age groups" 
    ) +
    theme_light() + 
     stat_density(aes(x=X_vege_green_daily, y=..scaled.., color=X_age_g), geom="line") + 
    scale_color_brewer(palette = "Set1")

Health benchmark correlation
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_vege_green_daily, y=X_health_benchmark, color=X_age_g )) +
   labs( 
     x="green vegetables intake", 
     title = "health score and green vegetables daily intake", 
     color="age" 
    ) +
    theme_light() +
    geom_smooth() +
    facet_wrap( ~ sex )+ 
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Nonlinear. Positive on a continuous interval [0,1] . Extended analysis will be needed.

3.2.4 Orange vegetables

fvorang: How Many Times Did You Eat Orange-Colored Vegetables?

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(fvorang))

#daily orange vegetables consumption
dat_q3 <- dat_q3 %>% 
  mutate(X_vege_orange_daily = case_when(
                      fvorang < 200 ~ fvorang %% 100 ,
                      fvorang > 200 & fvorang < 300 ~ fvorang %% 200 /7,
                      fvorang > 300 ~ fvorang %% 300 / 30,
                      TRUE ~ 0
                   )
           ) 

#checking for outliers
dat_q3 %>% 
dplyr::select(X_vege_orange_daily) %>% 
  summarise(
    v_mean = mean(X_vege_orange_daily), 
    v_median = median(X_vege_orange_daily), 
    v_sd = sd(X_vege_orange_daily), 
    v_min = min(X_vege_orange_daily), 
    v_max = max(X_vege_orange_daily),
    v_count = n()
  )
##     v_mean  v_median      v_sd v_min v_max v_count
## 1 0.289208 0.1666667 0.3614312     0    25  428900

Statistic looks OK, but 25 seems too much for a daily intake

Raw data distribution
#checking distribution 
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of orange vegetables daily intake", 
      x="orange vegetables daily intake", 
      color="Age groups"
    ) +
    theme_light() +
    stat_density(aes(x=X_vege_orange_daily+0.01, y=..scaled.., color=X_age_g), geom="line") + 
    scale_color_brewer(palette = "Set1")

Distribution analysis

Distribution looks right skewed with some outliers.

#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(X_vege_orange_daily <= quantile(dat_q3$X_vege_orange_daily,0.99))

descdist(dat_q3$X_vege_orange_daily, discrete = FALSE)

## summary statistics
## ------
## min:  0   max:  2 
## median:  0.1666667 
## mean:  0.2801344 
## estimated sd:  0.3123573 
## estimated skewness:  2.174206 
## estimated kurtosis:  9.801851
Distribution fitness
fit_X_vege_orange.exp <- fitdist(dat_q3$X_vege_orange_daily, "exp")

plot(fit_X_vege_orange.exp)

Mix of continuous and discrete distribution. Extended analysis will be needed later.

Final distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of orange vegetables daily intake", 
      x="orange vegetables daily intake", 
      color="Age groups" 
    ) +
    theme_light() +
    stat_density(aes(x=X_vege_orange_daily, y=..scaled.., color=X_age_g), geom="line") +
   scale_color_brewer(palette = "Set1")

Health benchmark correlation
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_vege_orange_daily, y=X_health_benchmark, color=X_age_g )) +
   labs( 
     x="orange vegetables intake", 
     title = "health score and orange vegetables daily intake", 
     color="age" 
    ) +
    theme_light() +
    geom_smooth() +
    facet_wrap( ~ sex )+ 
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Nonlinear. Positive on an interval [0,1] . Extended analysis will be needed.

3.2.5 Beans

fvbeans: How Many Times Did You Eat Beans Or Lentils?

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(fvbeans))

#weekly beans consumption
dat_q3 <- dat_q3 %>% 
  mutate(X_beans_weekly = case_when(
                      fvbeans < 200 ~ fvbeans %% 100 * 7,
                      fvbeans > 200 & fvbeans < 300 ~ fvbeans %% 200  ,
                      fvbeans > 300 ~ fvbeans %% 300 / 4,
                      TRUE ~ 0
                   )
           ) 

#checking for outliers
dat_q3 %>% 
dplyr::select(X_beans_weekly) %>% 
  summarise(
    b_mean = mean(X_beans_weekly), 
    b_median = median(X_beans_weekly), 
    b_sd = sd(X_beans_weekly), 
    b_min = min(X_beans_weekly), 
    b_max = max(X_beans_weekly),
    b_count = n()
  ) 
##     b_mean b_median     b_sd b_min b_max b_count
## 1 1.963347        1 2.652742     0   385  423463
Raw data distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of beans weekly intake", 
      x="bean weekly intake", 
      color="Age groups" 
    ) +
    theme_light() + 
    stat_density(aes(x=X_beans_weekly+0.01, y=..scaled.., color=X_age_g), geom="line") +
    scale_x_continuous(trans = log10_trans(),
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x))) + 
    scale_color_brewer(palette = "Set1")

Distribution analysis

Distribution looks right skewed with some outliers.

#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(X_beans_weekly <= quantile(dat_q3$X_beans_weekly,0.99))

descdist(dat_q3$X_beans_weekly, discrete = FALSE)

## summary statistics
## ------
## min:  0   max:  14 
## median:  1 
## mean:  1.888232 
## estimated sd:  2.169395 
## estimated skewness:  2.306955 
## estimated kurtosis:  10.75185
Distribution fitness
fit_X_beans_weekly.exp <- fitdist(dat_q3$X_beans_weekly, "exp")

plot(fit_X_beans_weekly.exp)

Mix of continuous and discrete distribution. Extended analysis will be needed later.

Final distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of beans weekly intake", 
      x="bean weekly intake", 
      color="Age groups" 
    ) +
    theme_light() + 
    stat_density(aes(x=X_beans_weekly, y=..scaled.., color=X_age_g), geom="line") + 
    scale_color_brewer(palette = "Set1")

Health benchmark correlation
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_beans_weekly, y=X_health_benchmark, color=X_age_g )) +
   labs( x="beans weekly intake", title = "health score and beans intake", color="age" ) +
    theme_light() +
    geom_smooth() +
    facet_wrap( ~ X_age_g)+ 
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Nonlinear with optimal point around 2

3.2.6 Physical activity

exerany2: Exercise In Past 30 Days

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(exerany2)) 
Data distribution
#calculating percentage inside age groups
for_graph <- dat_q3 %>% 
  group_by( X_age_g, exerany2) %>% 
  summarise(n=n()) %>% 
  group_by( X_age_g) %>% 
  mutate(perc=n/sum(n)) 

#distribution inside age groups
for_graph %>%
   ggplot(aes(x = X_age_g, y = perc, fill=exerany2)) +
    labs( 
      title = "Exercising distribution in age groups", 
      x="", 
      y = "percent",  
      fill="Exercise" ) +
    theme_light() +
    geom_bar(stat="identity" ) +
    scale_fill_manual(values=c("#e6f5d0", "#c51b7d")) 

Health benchmark correlation
dat_q3 %>%
  ggplot(aes(x=exerany2, y=X_health_benchmark, color=X_age_g )) +
    labs( x="exercising", title = "health score and exercising", color="age" ) +
    theme_light() +
    geom_boxplot(alpha=0) +
    facet_wrap( ~ sex )+
    scale_color_brewer(palette = "Set1")

Noticeable positive correlation in all groups.

3.2.7 Alcohol consumption

X_drnkmo4: Computed Total Number Drinks A Month

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(X_drnkmo4)) 

#checking for outliers
dat_q3 %>% 
dplyr::select(X_drnkmo4) %>% 
  summarise(
    a_mean = mean(X_drnkmo4), 
    a_median = median(X_drnkmo4), 
    a_sd = sd(X_drnkmo4), 
    a_min = min(X_drnkmo4), 
    a_max = max(X_drnkmo4),
    a_count = n()
  )
##     a_mean a_median   a_sd a_min a_max a_count
## 1 11.32152        1 32.436     0  2280  413104

It looks OK, but 2280 seems like a too festive life even for Russian, so checking distribution

Raw data distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of alcohol monthly intake", 
      x="alcohol monthly intake", 
      color="Age groups" 
    ) +
    theme_light() + 
    stat_density(aes(x=X_drnkmo4 + 0.01, y=..scaled.., color=X_age_g), geom="line") +
    scale_x_continuous(trans = log10_trans(),
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x))) +
    facet_wrap(~sex) +
    scale_color_brewer(palette = "Set1")

Normalizing to CDC definition as heavy drinking

#normalizing according to CDC recommendations
dat_q3 <- dat_q3 %>%
  mutate(X_alcohol_num = case_when( 
                        as.numeric(sex) == 1 ~ (X_drnkmo4 / 60), 
                        as.numeric(sex) == 2 ~ (X_drnkmo4 / 30),
                        TRUE ~ 0,
  ))
Distribution analysis
#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(X_alcohol_num <= quantile(dat_q3$X_alcohol_num,0.99))

descdist(dat_q3$X_alcohol_num, discrete = FALSE)

## summary statistics
## ------
## min:  0   max:  2.666667 
## median:  0.01666667 
## mean:  0.2132541 
## estimated sd:  0.4083644 
## estimated skewness:  2.756143 
## estimated kurtosis:  11.34199
Distribution fitness
#to use beta needs scaling into the range [0,1]
X_alcohol_num_scaled <- 
  (dat_q3$X_alcohol_num - min(dat_q3$X_alcohol_num) + 0.001) / 
  (max(dat_q3$X_alcohol_num) - min(dat_q3$X_alcohol_num) + 0.002)

fit_X_alcohol_num_scaled.beta <- fitdist(X_alcohol_num_scaled, "beta")

plot(fit_X_alcohol_num_scaled.beta)

Mix of continuous and discrete distribution. Extended analysis will be needed later.

Final distribution
#Normalized Distribution of alcohol intake
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Normalized Distribution of alcohol intake", 
      x="normalized alcohol intake", 
      color="Age groups"
    ) +
    theme_light() + 
    stat_density(aes(x=X_alcohol_num, y=..scaled.., color=X_age_g), geom="line") +
    facet_wrap(~sex) +
    scale_color_brewer(palette = "Set1")

Health benchmark
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_alcohol_num, y=X_health_benchmark, color=X_age_g )) +
   labs( x="normalized alcohol consumption", title = "health score and alcohol", color="age" ) +
    theme_light() +
    geom_smooth() +
    facet_wrap( ~ X_age_g )+
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Nonlinear. Very tempting to say about optimal value here =)))) But it could be due to survival bias.

3.2.8 Smoking

X_rfsmok3: Current Smoking Calculated Variable

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(X_rfsmok3)) 
Data distribution
#calculating percentage for smoking distribution inside age groups
for_graph <- dat_q3 %>% 
  group_by( X_age_g, X_rfsmok3) %>% 
  summarise(n=n()) %>% 
  group_by( X_age_g) %>% 
  mutate(perc=n/sum(n)) 

#Smoking distribution inside age groups
for_graph %>%
   ggplot(aes(x = X_age_g, y = perc, fill=X_rfsmok3)) +
      labs( 
        title = "Smoking distribution in age groups", 
        x="", 
        y = "percent",  
        fill="smoking" 
      ) +
      theme_light() +
      geom_bar(stat="identity" ) +
    scale_fill_manual(values=c("#e6f5d0", "#c51b7d")) 

Health benchmark correlation
dat_q3 %>%
  filter(X_health_benchmark>-2) %>%
  ggplot(aes(x=X_rfsmok3, y=X_health_benchmark, color=X_age_g )) +
   labs( x="smoking", title = "health score and smoking", color="age" ) +
    theme_light() +
    geom_boxplot(alpha=0) +
    facet_wrap( ~ sex )+
    scale_color_brewer(palette = "Set1")

Noticeable negative correlation, more prominent for females.

3.2.9 Sleep quality

sleptim1: How Much Time Do You Sleep

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(sleptim1))

#checking for outliers
dat_q3 %>% 
dplyr::select(sleptim1) %>% 
  summarise(
    slp_mean = mean(sleptim1), 
    slp_median = median(sleptim1), 
    slp_sd = sd(sleptim1), 
    slp_min = min(sleptim1), 
    slp_max = max(sleptim1),
    slp_count = n()
  )
##   slp_mean slp_median   slp_sd slp_min slp_max slp_count
## 1 7.045731          7 1.427169       1      24    403995
Raw data distribution
dat_q3 %>%
  ggplot() +
    labs( 
      title = "Distribution of sleep hours daily", 
      x="sleeping hours", 
      color="Age" 
    ) +
    theme_light() + 
    stat_density(aes(x=sleptim1+0.01, y=..scaled.., color=X_age_g), geom="line") +
    scale_x_continuous(trans = "log10",
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x)))  +
    scale_color_brewer(palette = "Set1")

Final distribution
#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(sleptim1 <= quantile(dat_q3$sleptim1,0.99))

#normalizing to 6 hours a day 
dat_q3 <- dat_q3 %>%
  mutate(X_sleptim_norm = sleptim1/6)

dat_q3 %>%
  ggplot() +
    labs( title = "Distribution of sleeping hours daily", x="sleeping hours", color="Age groups" ) +
    theme_light() + 
    stat_density(aes(x=X_sleptim_norm, y=..scaled.., color=X_age_g), geom="line") +
    scale_color_brewer(palette = "Set1")

Health benchmark correlation
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_sleptim_norm, y=X_health_benchmark, color=X_age_g )) +
   labs( x="sleeping time normalized", title = "health score and sleep", color="age" ) +
   theme_light() +
   geom_smooth() +
   facet_wrap( ~ sex)+
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The relationship is not linear with an optimal value between 1 and 1,5.

3.2.10 Sugars - fruit juice

fruitju1: How Many Times Did You Drink 100 Percent Pure Fruit Juices?

Data cleaning
#cleaning NA values
dat_q3 <- dat_q3 %>%
  filter(!is.na(fruitju1)) 

#daily juice consumption
dat_q3 <- dat_q3 %>% 
  mutate(X_sugars_juice = case_when(
                      fruitju1 < 200 ~ fruitju1 %% 100,
                      fruitju1 > 200 & fruitju1 < 300 ~ fruitju1 %% 200 / 7,
                      fruitju1 > 300 ~ fruitju1 %% 300 / 30,
                      TRUE ~ 0
                   )
           ) 

#checking for outliers
dat_q3 %>% 
dplyr::select(X_sugars_juice) %>% 
  summarise(
    sugar_mean = mean(X_sugars_juice), 
    sugar_median = median(X_sugars_juice), 
    sugar_sd = sd(X_sugars_juice), 
    sugar_min = min(X_sugars_juice), 
    sugar_max = max(X_sugars_juice), 
    sugar_count = n() 
  )
##   sugar_mean sugar_median  sugar_sd sugar_min sugar_max sugar_count
## 1  0.3810245    0.1333333 0.6432528         0        99      398386
Raw data distribution
dat_q3 %>%
  ggplot(aes(x = X_sugars_juice,color=X_age_g )) +
    labs( 
      title = "Distribution of daily juice consumption", 
      x="daily juice consumption", 
      color="age group" 
    ) +
    theme_light()    +
    stat_density(aes(x=X_sugars_juice+0.01, y=..scaled.., color=X_age_g), geom="line") +
    scale_x_continuous(trans = log10_trans(),
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x)))   +
    scale_color_brewer(palette = "Set1")

Right skewed with outliers

Final distribution
#Cutting up to the 99th percentile 
dat_q3 <- dat_q3 %>%
  filter(X_sugars_juice <= quantile(dat_q3$X_sugars_juice,0.99))

dat_q3 %>%
  ggplot(aes(x = X_sugars_juice, color = X_age_g )) +
    labs( 
      title = "Distribution of daily juice consumption", 
      x="daily juice consumption", 
      color="age group" 
    ) +
    theme_light() + 
    stat_density(aes(x=X_sugars_juice, y=..scaled.., color=X_age_g), geom="line")+
    scale_color_brewer(palette = "Set1") 

Health benchmark correlation
dat_q3 %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_sugars_juice, y=X_health_benchmark, color=X_age_g )) +
   labs( x="juice consumption", title = "health score and fresh juice", color="age" ) +
    theme_light() +
    geom_smooth() +
    facet_wrap( ~ X_age_g + sex )+
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Nonlinear for some groups, very mixed results, will not include in total sugars, but add as a separate unit in the model.

3.2.11 Sugars - regular soda

ssbsugar: How Often Do You Drink Regular Soda Or Pop?

ssbsugar and ssbsugar are from Optional Module, thus there are not many data records available. So I have to create a separate dataset.

Data cleaning
#cleaning NA values 
dat_q3_sugar <- dat_q3 %>%
  filter(!is.na(ssbsugar)) 

#daily soda consumption
dat_q3_sugar <- dat_q3_sugar %>% 
  mutate(X_sugars_soda = case_when(
                      ssbsugar < 200 ~ ssbsugar %% 100,
                      ssbsugar > 200 & ssbsugar < 300 ~ ssbsugar %% 200 / 7,
                      ssbsugar > 300 ~ ssbsugar %% 300 / 30,
                      TRUE ~ 0
                   )
           )  

#checking for outliers
dat_q3_sugar %>% 
dplyr::select(X_sugars_soda) %>% 
  summarise(
    sugar_mean = mean(X_sugars_soda), 
    sugar_median = median(X_sugars_soda), 
    sugar_sd = sd(X_sugars_soda), 
    sugar_min = min(X_sugars_soda), 
    sugar_max = max(X_sugars_soda), 
    sugar_count = n() 
  )
##   sugar_mean sugar_median  sugar_sd sugar_min sugar_max sugar_count
## 1  0.3571433   0.03333333 0.8879977         0        40       90005
Raw data distribution
dat_q3_sugar %>%
  ggplot(aes(x = X_sugars_soda,color=X_age_g )) +
    labs( 
      title = "Distribution of daily soda consumption", 
      x="daily soda consumption", 
      color="age group"
    ) +
    theme_light() + 
    stat_density(aes(x=X_sugars_soda, y=..scaled.., color=X_age_g), geom="line") +
    scale_color_brewer(palette = "Set1")

Right skewed with outliers

Final distribution
#Cutting up to the 99th percentile
dat_q3_sugar <- dat_q3_sugar %>%
  filter(X_sugars_soda <= quantile(dat_q3_sugar$X_sugars_soda,0.99))

dat_q3_sugar %>%
  ggplot(aes(x = X_sugars_soda,color=X_age_g )) +
    labs( 
      title = "Distribution of daily soda consumption", 
      x="daily soda consumption", 
      color="age group" 
    ) +
    theme_light() + 
    stat_density(aes(x=X_sugars_soda, y=..scaled.., color=X_age_g), geom="line") +
    scale_color_brewer(palette = "Set1")

Righ

Health benchmark correlation
dat_q3_sugar %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_sugars_soda, y=X_health_benchmark, color=X_age_g )) +
   labs( x="soda consumption", title = "health score and soda", color="age" ) +
    theme_light() +
    geom_smooth(method = "lm") +
    facet_wrap( ~ X_age_g )+
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using formula 'y ~ x'

There is an observable negative correlation

3.2.12 Sugars - other sweetened drinks

ssbfrut2: How Often Did You Drink Sugar-Sweetened Drinks?

Data cleaning
#cleaning NA values
dat_q3_sugar <- dat_q3_sugar %>%
  filter(!is.na(ssbfrut2)) 

#daily sweetened drinks consumption
dat_q3_sugar <- dat_q3_sugar %>% 
  mutate(X_sugars_other = case_when(
                      ssbfrut2 < 200 ~ ssbfrut2 %% 100,
                      ssbfrut2 > 200 & ssbfrut2 < 300 ~ ssbfrut2 %% 200 / 7,
                      ssbfrut2 > 300 ~ ssbfrut2 %% 300 / 30,
                      TRUE ~ 0
                   )
           )  

#checking for outliers
dat_q3_sugar %>% 
dplyr::select(X_sugars_other) %>% 
  summarise(
    sugar_mean = mean(X_sugars_other), 
    sugar_median = median(X_sugars_other), 
    sugar_sd = sd(X_sugars_other), 
    sugar_min = min(X_sugars_other), 
    sugar_max = max(X_sugars_other), 
    sugar_count = n()
  )
##   sugar_mean sugar_median sugar_sd sugar_min sugar_max sugar_count
## 1  0.2810607            0 0.747307         0        60       89058
Raw data distribution
dat_q3_sugar %>%
  ggplot() +
    labs( 
      title = "Distribution of sweetened drinks consumption", 
      x="sweetened drinks consumption", 
      color="age group" 
    ) +
    theme_light() +
     stat_density(aes(x=X_sugars_other +0.01, y=..scaled.., color=X_age_g), geom="line") +
    scale_x_continuous(trans = log10_trans(),
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x)))   +
    scale_color_brewer(palette = "Set1")

Right skewed with outliers

Final distribution
#Cutting up to the 99th percentile
dat_q3_sugar <- dat_q3_sugar %>%
  filter(X_sugars_other <= quantile(dat_q3_sugar$X_sugars_other,0.99))

#distribution
dat_q3_sugar %>%
  ggplot() +
    labs( 
      title = "Distribution of sweetened drinks consumption", 
      x="sweetened drinks consumption", 
      color="age group" 
    ) +
    theme_light() +
    stat_density(aes(x=X_sugars_other, y=..scaled.., color=X_age_g), geom="line")+
    scale_color_brewer(palette = "Set1")

Health benchmark correlation
dat_q3_sugar %>%
  filter(!is.na(X_health_benchmark)) %>%
  ggplot(aes(x=X_sugars_other, y=X_health_benchmark, color=X_age_g )) +
   labs( 
     x="sweetened drinks consumption", 
     title = "health score and sweetened drinks", 
     color="age" 
    ) +
    theme_light() +
    geom_smooth(method = "lm") +
    facet_wrap( ~ X_age_g, scales="free_x" )+
    scale_color_brewer(palette = "Set1")
## `geom_smooth()` using formula 'y ~ x'

There is an observable negative correlation

3.2.13 Correlation tables

I don’t have enough knowledge for now how to use complicated non-linear models and inference, so I will use preliminary correlation analysis

Sugars, health benchmark and BMI correlation

For the sugar consumption I will take soda and other sweetened drinks.

#combining all sugars
dat_q3_sugar <- dat_q3_sugar %>% 
  mutate(X_sugars_total = (  X_sugars_soda + X_sugars_other ))

#checking distribution 
dat_q3_sugar %>%
  ggplot() +
  labs( 
    title = "Distribution of sugary drinks consumption", 
    x="sugary drinks consumption", 
    color="age group" 
  ) +
  theme_light() + 
  stat_density(aes(x=X_sugars_total, y=..scaled.., color=X_age_g), geom="line") +
  scale_color_brewer(palette = "Set1")

Example of the output of the model for male (sex=1) age 45 to 54 (X_age_g=4).

#create diabetes numerical var
dat_q3_sugar <- dat_q3_sugar %>% 
  filter(!is.na(diabete3)) %>% 
  mutate(diabetes_num = case_when( 
                        as.numeric(diabete3) == 3 ~ 0, 
                        TRUE ~ 1 
  ) )  

#selecting data for the model
m_cor_data <- dat_q3_sugar %>% 
  filter(as.numeric(X_age_g)==4,as.numeric(sex)==2,!is.na(X_health_benchmark)) %>%
  dplyr::select(
          X_health_benchmark, 
          X_sugars_total, 
          X_sugars_juice,
          diabetes_num,
          X_bmi5
          ) 

#creating correlation table
matrix.cor <- cor(m_cor_data, use = "complete.obs")
corrplot(matrix.cor, method="color",  
         type="upper",
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
          # hide correlation coefficient on the principal diagonal
         diag=FALSE,  number.cex= 7/ncol(matrix.cor)
         )

About negative correlation and diabetes: it is most likely due to the observational nature of the study - it is expected behavior from diabetic to stop consuming sugary drinks and other fast carbs. What is surprising for me - low correlation of BMI and sugar consumption.

Other parameters correlation

Example of the output of the model for male (sex=1) age 45 to 54 (X_age_g=4). I didn’t include BMI to the lifestyle score but I include it here

#create diabetes numerical var
dat_q3 <- dat_q3 %>% 
  filter(!is.na(diabete3)) %>% 
  mutate(diabetes_num = case_when( 
                        as.numeric(diabete3) == 3 ~ 0, 
                        TRUE ~ 1 
  ) )  

#physical activity numerical var
dat_q3 <- dat_q3 %>%
  mutate(X_exe_num = case_when( 
                        exerany2 == "Yes" ~ 1, 
                        exerany2 == "No" ~ 0 
  ))

#smoking numerical var
dat_q3 <- dat_q3 %>%
  mutate(X_smoking_num = case_when( 
                        X_rfsmok3 == "Yes" ~ 1, 
                        X_rfsmok3 == "No" ~ 0 
  ))

#selecting data for the model
m_cor_data <- dat_q3 %>% 
  filter(as.numeric(X_age_g)==4,as.numeric(sex)==1,!is.na(X_health_benchmark) ) %>%
  dplyr::select(
          X_health_benchmark, 
          X_bmi5,
          diabetes_num,
          X_vege_green_daily,
          X_vege_orange_daily,
          X_fruit_daily,
          X_beans_weekly,
          X_exe_num, 
          X_smoking_num,
          X_alcohol_num,
          X_sleptim_norm,

          ) 
#creating correlation table
matrix.cor <- cor(m_cor_data, use = "complete.obs")
corrplot(matrix.cor, method="color",  
         type="upper",
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
          # hide correlation coefficient on the principal diagonal
         diag=FALSE,  number.cex= 7/ncol(matrix.cor)
         )

3.3.14 Conclusion

  1. On almost all parameters related to food habits correlation is pretty weak. I think one reason for this may be that data is taken from an epidemiological observational study, thus it can be the other way around - people with health problems switching to a more healthy lifestyle in order to mitigate their problem. Low correlation of BMI with almost every lifestyle related variable was a surprise for me though, it needs additional exploration in the future.

    There are a lot of non-linear relationships which require extensive analysis, that is out of the scope of this study. The underlying reason may be non-linearity of response in biological systems (hormetic property) - too much of a good thing (physical activity for example) can be bad and sometimes even a bad thing in a small amount can be beneficial (my hope for the case with alcohol =))) In the case of fresh fruit juices and beans - could be some mixed impact of high amount of sugar, fiber and vitamins. Some equivocal results could also be due to the observational nature of the study (survival bias).
Bei-Fan, Zhou. 2002. “Predictive Values of Body Mass Index and Waist Circumference for Risk Factors of Certain Related Diseases in Chinese Adults: Study on Optimal Cut-Off Points of Body Mass Index and Waist Circumference in Chinese Adults.” Asia Pacific Journal of Clinical Nutrition 11 (December): S685–93. https://doi.org/10.1046/j.1440-6047.11.s8.9.x.
Bray, George A. 2004. “Medical Consequences of Obesity.” The Journal of Clinical Endocrinology & Metabolism 89 (6): 2583–89. https://doi.org/10.1210/jc.2004-0535.
DiMeglio, Linda A, Carmella Evans-Molina, and Richard A Oram. 2018. “Type 1 Diabetes.” The Lancet 391 (10138): 2449–62. https://doi.org/10.1016/s0140-6736(18)31320-5.
Galicia-Garcia, Unai, Asier Benito-Vicente, Shifa Jebari, Asier Larrea-Sebal, Haziq Siddiqi, Kepa B. Uribe, Helena Ostolaza, and César Martín. 2020. “Pathophysiology of Type 2 Diabetes Mellitus.” International Journal of Molecular Sciences 21 (17): 6275. https://doi.org/10.3390/ijms21176275.
Grundy, Scott M. 2004. “Obesity, Metabolic Syndrome, and Cardiovascular Disease.” The Journal of Clinical Endocrinology & Metabolism 89 (6): 2595–2600. https://doi.org/10.1210/jc.2004-0372.
Harding, Jessica L., Meda E. Pavkov, Dianna J. Magliano, Jonathan E. Shaw, and Edward W. Gregg. 2018. “Global Trends in Diabetes Complications: A Review of Current Evidence.” Diabetologia 62 (1): 3–16. https://doi.org/10.1007/s00125-018-4711-2.
Haslam, David W, and W Philip T James. 2005. “Obesity.” The Lancet 366 (9492): 1197–1209. https://doi.org/10.1016/s0140-6736(05)67483-1.
Kanazawa, Masao, Nobuo Yoshiike, Toshimasa Osaka, Yoshio Numba, Paul Zimmet, and Shuji Inoue. 2005. “Criteria and Classification of Obesity in Japan and Asia-Oceania.” In, 1–12. KARGER. https://doi.org/10.1159/000088200.
Katsarou, Anastasia, Soffia Gudbjörnsdottir, Araz Rawshani, Dana Dabelea, Ezio Bonifacio, Barbara J. Anderson, Laura M. Jacobsen, Desmond A. Schatz, and Åke Lernmark. 2017. “Type 1 Diabetes Mellitus.” Nature Reviews Disease Primers 3 (1). https://doi.org/10.1038/nrdp.2017.16.
Kitabchi, A. E., G. E. Umpierrez, J. M. Miles, and J. N. Fisher. 2009. “Hyperglycemic Crises in Adult Patients With Diabetes.” Diabetes Care 32 (7): 1335–43. https://doi.org/10.2337/dc09-9032.
Kolb, Hubert, and Stephan Martin. 2017. “Environmental/Lifestyle Factors in the Pathogenesis and Prevention of Type 2 Diabetes.” BMC Medicine 15 (1). https://doi.org/10.1186/s12916-017-0901-x.
Petersen, Max C., and Gerald I. Shulman. 2018. “Mechanisms of Insulin Action and Insulin Resistance.” Physiological Reviews 98 (4): 2133–223. https://doi.org/10.1152/physrev.00063.2017.
Plows, Jasmine, Joanna Stanley, Philip Baker, Clare Reynolds, and Mark Vickers. 2018. “The Pathophysiology of Gestational Diabetes Mellitus.” International Journal of Molecular Sciences 19 (11): 3342. https://doi.org/10.3390/ijms19113342.
Schneider, Karen L, Melissa A Clark, William Rakowski, and Kate L Lapane. 2010. “Evaluating the Impact of Non-Response Bias in the Behavioral Risk Factor Surveillance System (BRFSS).” Journal of Epidemiology and Community Health 66 (4): 290–95. https://doi.org/10.1136/jech.2009.103861.
Schulze, Matthias B., and Frank B. Hu. 2005. “PRIMARY PREVENTION OF DIABETES: What Can Be Done and How Much Can Be Prevented?” Annual Review of Public Health 26 (1): 445–67. https://doi.org/10.1146/annurev.publhealth.26.021304.144532.
Sweeting, Helen N. 2007. “Measurement and Definitions of Obesity In Childhood and Adolescence: A Field Guide for the Uninitiated.” Nutrition Journal 6 (1). https://doi.org/10.1186/1475-2891-6-32.